## Fine-tuning Llama 3.2 Vision using Trainer

Transformers Trainer API makes it easy to fine-tune Llama-Vision models. One can also use parameter-efficient fine-tuning techniques out of the box thanks to transformers integration. Make sure to have latest version of transformers.


We will fine-tune the model on a small split of VQAv2 dataset for educational purposes. If you want, you can also use a dataset where there’s multiple turns of conversation at one example. This dataset consists of images, questions about the images and short answers.


In [3]:
from datasets import load_dataset

ds = load_dataset("merve/vqav2-small", split="validation[:10%]")

In [4]:
ds

Dataset({
    features: ['multiple_choice_answer', 'question', 'image'],
    num_rows: 2144
})

We have to authenticate outselves before downloading the model. 

In [5]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

We can now initialize the model and the processor, for we will use the processor in our preprocessing function. We will initialize the 11B variant of the vision model. 

Llama authors encourage freezing text decoder and only training image encoder. If you would like to try this out, feel free to set `FREEZE_LLM` to `True`. Intuitively, if your task is too domain specific, you might want to avoid this. In that case, you can either try LoRA training (which you can set `USE_LORA` to `True`), or freezing image encoder (set `FREEZE_IMAGE` to `True`) to save up compute.


In [None]:
from transformers import MllamaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
import torch

ckpt = "meta-llama/Llama-3.2-11B-Vision"
USE_LORA = True
FREEZE_LLM = False
FREEZE_IMAGE = False

if USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
        use_dora=True, # optional DoRA 
        init_lora_weights="gaussian"
    )

    model = MllamaForConditionalGeneration.from_pretrained(
            ckpt,
            torch_dtype=torch.bfloat16,
            device_map="auto"
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

elif FREEZE_IMAGE:
    if FREEZE_LLM:
        raise ValueError("You cannot freeze image encoder and text decoder at the same time.")
    model = MllamaForConditionalGeneration.from_pretrained(ckpt,
        torch_dtype=torch.bfloat16, device_map="auto")
    # freeze vision model to save up on compute
    for param in model.vision_model.parameters():
        param.requires_grad = False

elif FREEZE_LLM:
    if FREEZE_IMAGE:
        raise ValueError("You cannot freeze image encoder and text decoder at the same time.")
    model = MllamaForConditionalGeneration.from_pretrained(ckpt,
        torch_dtype=torch.bfloat16, device_map="auto")
    # freeze text model, this is encouraged in paper
    for param in model.language_model.parameters():
        param.requires_grad = False
        
else: # full ft
    model = MllamaForConditionalGeneration.from_pretrained(ckpt,
        torch_dtype=torch.bfloat16, device_map="auto")

processor = AutoProcessor.from_pretrained(ckpt)

For preprocessing, we will put together questions and answers. In between questions and answers we will put a conditioning phrase, which will condition the model and trigger question answering, in this case it’s “Answer briefly.”. 
To process images, we simply have to batch every image and put them as list of singular images. This is needed due to how processor can take a list of multiple images at once with a single text input, so we have to indicate that these are single images for each example.
Lastly, we will set pad tokens and image tokens to -100 to make model ignore these tokens.


In [None]:
def process(examples):
    texts = [f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<|image|>{example['question']} Answer briefly. <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n{example['multiple_choice_answer']}<|eot_id|>" for example in examples]
    images = [[example["image"].convert("RGB")] for example in examples]

    batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100 
    labels[labels == 128256] = -100 # image token index
    batch["labels"] = labels
    batch = batch.to(torch.bfloat16).to("cuda")

    return batch


We can now setup our Trainer. Before that, we will setup the arguments we pass to the 
Trainer.

In [None]:
from transformers import TrainingArguments
args=TrainingArguments(
            num_train_epochs=2,
            remove_unused_columns=False,
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            learning_rate=2e-5,
            weight_decay=1e-6,
            adam_beta2=0.999,
            logging_steps=250,
            save_strategy="no",
            optim="adamw_hf",
            push_to_hub=True,
            save_total_limit=1,
            bf16=True,
            output_dir="./lora",
            dataloader_pin_memory=False,
        )

We can now initialize the Trainer and start training.


In [None]:
from transformers import Trainer
trainer = Trainer(
        model=model,
        train_dataset=ds,
        data_collator=process,
        args=args
        )

Call train.

In [None]:
trainer.train()